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Abstract 

Single-sample  face  recognition  is  one  of  the  most  chal¬ 
lenging  problems  in  face  recognition.  We  propose  a  novel 
face  recognition  algorithm  to  address  this  problem  based  on 
a  sparse  representation  based  classification  (SRC)  frame¬ 
work.  The  new  algorithm  is  robust  to  image  misalignment 
and  pixel  corruption,  and  is  able  to  reduce  required  training 
images  to  one  sample  per  class.  To  compensate  the  miss¬ 
ing  illumination  information  typically  provided  by  multiple 
training  images,  a  sparse  illumination  transfer  (SIT)  tech¬ 
nique  is  introduced.  The  SIT  algorithms  seek  additional  il¬ 
lumination  examples  of  face  images  from  one  or  more  addi¬ 
tional  subject  classes,  and  form  an  illumination  dictionary. 
By  enforcing  a  sparse  representation  of  the  query  image, 
the  method  can  recover  and  transfer  the  pose  and  illumi¬ 
nation  information  from  the  alignment  stage  to  the  recog¬ 
nition  stage.  Our  extensive  experiments  have  demonstrated 
that  the  new  algorithms  significantly  outperform  the  exist¬ 
ing  algorithms  in  the  single-sample  regime  and  with  less 
restrictions.  In  particular,  the  face  alignment  accuracy  is 
comparable  to  that  of  the  well-known  Deformable  SRC  al¬ 
gorithm  using  multiple  training  images;  and  the  face  recog¬ 
nition  accuracy  exceeds  those  of  the  SRC  and  Extended  SRC 
algorithms  using  hand  labeled  alignment  initialization. 

1.  Introduction 

Face  recognition  is  one  of  the  classical  problems  in  com¬ 
puter  vision.  Given  a  natural  image  that  may  contain  a  hu¬ 
man  face,  it  has  been  known  that  the  appearance  of  the  face 
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image  can  be  easily  affected  by  many  image  nuisances,  in¬ 
cluding  background  illumination,  pose,  and  facial  corrup¬ 
tion/disguise  such  as  makeup,  beard,  and  glasses.  Hence, 
to  develop  a  face  recognition  system  whose  performance 
can  be  comparable  to  or  even  exceed  that  of  human  vision, 
the  computer  system  needs  to  address  at  least  the  following 
three  closely  related  problems:  First,  it  needs  to  effectively 
model  the  change  of  illumination  on  the  human  face.  Sec¬ 
ond.  it  needs  to  align  the  pose  of  the  face.  Third,  it  needs 
to  tolerance  the  corruption  of  facial  features  that  leads  to 
potential  gross  pixel  error  against  the  training  images. 

In  the  literature,  many  well-known  solutions  have  been 
studied  to  tackle  these  problems  [13][32.  00,  although  a 
complete  review  of  the  field  is  outside  the  scope  of  this  pa¬ 
per.  More  recently,  a  new  face  recognition  framework  called 
sparse-representation  based  classification  (SRC)  was  pro¬ 
posed  (26),  which  can  successfully  address  most  of  the 
above  problems.  The  framework  is  built  on  a  subspace 
illumination  model  characterizing  the  distribution  of  a 
corruption-free  face  image  sample  (stacked  in  vector  form) 
under  a  fixed  pose,  one  subspace  model  per  subject  class 
mm.  When  an  unknown  query  image  is  jointly  represented 
by  all  the  subspace  models,  only  a  small  subset  of  these 
subspace  coefficients  need  to  be  nonzero,  which  would  pri¬ 
marily  correspond  to  the  subspace  model  of  the  true  sub¬ 
ject.  Therefore,  by  optimizing  the  sparsity  of  such  an  over¬ 
complete  linear  representation,  the  dominant  nonzero  coef¬ 
ficients  indicate  the  identity  of  the  query  image.  In  the  case 
of  image  corruption,  since  the  corruption  typically  only  af¬ 
fects  a  sparse  set  of  pixel  values,  one  can  concurrently  opti¬ 
mize  a  sparse  error  term  in  the  image  space  to  compensate 
for  the  corrupted  pixel  values. 

In  practice,  a  face  image  may  appear  at  any  image  lo¬ 
cation  with  random  background.  Hence,  a  face  detection 
and  registration  step  is  typically  first  used  to  detect  the  face 
image.  Most  of  the  methods  in  face  detection  would  learn 
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a  class  of  local  image  features/patches  that  are  sensitive  to 
the  appearance  of  key  facial  features  t27l  l23l  fTTl.  Using 
either  an  active  shape  model  0  or  an  active  appearance 
model  the  location  of  the  face  can  be  detected  even 
when  the  expression  of  the  face  is  not  neutral  or  some  fa¬ 
cial  features  are  occluded  ET1 [12] .  However,  using  these 
face  registration  algorithms  alone  is  not  sufficient  to  align 
a  query  image  to  training  images  for  SRC.  The  main  rea¬ 
sons  are  two-fold:  First,  except  for  some  fast  detectors  such 
as  Viola- Jones  [231,  more  sophisticated  detectors  are  ex¬ 
pensive  to  run  and  require  learning  prior  distribution  of  the 
shape  model  from  meticulously  hand-labeled  training  im¬ 
ages.  More  importantly,  these  detectors  would  register  the 
pixel  values  of  the  query  image  with  respect  to  the  average 
shape  model  learned  from  all  the  training  images,  but  they 
typically  cannot  align  the  pixel  values  of  the  query  image 
to  the  training  images  for  the  purpose  of  recognition,  as  re¬ 
quired  in  SRC. 

Following  the  sparse  representation  framework  in  l26l 
241,  we  propose  a  novel  algorithm  to  effectively  extend 
SRC  for  face  alignment  and  recognition  in  the  small  sam¬ 
ple  set  scenario.  We  observe  that  in  addition  to  the  well- 
understood  image  nuisances  aforementioned,  one  of  the  re¬ 
maining  challenges  in  face  recognition  is  indeed  the  small 
sample  set  problem.  For  instance,  in  many  biometric, 
surveillance,  and  Internet  applications,  there  may  be  only 
a  few  training  examples  per  subject  that  are  collected  in  the 
wild,  and  the  subjects  of  interest  may  not  be  able  to  undergo 
an  extended  image  collection  session  in  a  laboratory^ 

Unfortunately,  most  of  the  existing  SRC-based  align¬ 
ment  and  recognition  algorithms  would  fail  in  such  sce¬ 
narios.  For  starters,  the  original  SRC  algorithm  |[26l  as¬ 
sumes  a  plurality  of  training  samples  from  each  class  must 
sufficiently  span  its  illumination  subspace.  The  algorithm 
would  perform  poorly  in  the  single  sample  regime,  as  we 
will  shown  in  our  experiment  later.  In  [24],  in  order  to  guar¬ 
antee  the  training  images  contain  sufficient  illumination  pat¬ 
terns,  the  test  subjects  must  further  go  through  a  nontrivial 
passport-style  image  collection  process  in  a  dark  room  in  or¬ 
der  to  be  entered  into  the  training  database.  More  recently, 
another  development  in  the  SRC  framework  is  simultane¬ 
ous  face  alignment  and  recognition  methods  Ell  [15]  |30). 
Nevertheless,  these  methods  did  not  go  beyond  the  basic  as¬ 
sumption  used  in  SRC  and  other  prior  art  that  the  face  illu¬ 
mination  model  is  measured  by  a  plurality  of  training  sam¬ 
ples  for  each  class.  Furthermore,  as  shown  in  l24l.  robust 
face  alignment  and  recognition  can  be  solved  separately  as 
a  two-step  process,  as  long  as  the  recovered  image  transfor¬ 
mation  can  be  carried  over  from  the  alignment  stage  to  the 

lln  this  paper,  we  use  Viola- Jones  face  detector  to  initialize  the  face 
image  location.  As  a  result,  we  do  not  consider  scenarios  where  the  face 
may  contain  a  large  3D  transformation  or  large  expression  change.  These 
more  severe  conditions  can  be  addressed  in  the  face  detection  stage  using 
more  sophisticated  face  models  as  we  mentioned  above. 


recognition  stage.  Therefore,  simultaneous  face  alignment 
and  recognition  could  make  the  already  expensive  sparse 
optimization  problem  even  more  difficult  to  solve. 

1.1.  Contributions 

Single- sample  face  alignment  and  recognition  represents 
an  important  step  towards  practical  face  recognition  solu¬ 
tions  using  images  collected  in  the  wild  or  on  the  Internet. 
We  contend  that  the  problem  can  be  solved  quite  effectively 
by  a  simple  yet  elegant  algorithm.  The  key  observation  is 
that  one  sample  per  class  mainly  deprives  the  algorithm  of 
an  illumination  subspace  model  for  each  individual  class. 
We  show  that  a  sparse  illumination  transfer  (SIT)  dictio¬ 
nary  can  be  constructed  to  compensate  the  lack  of  the  il¬ 
lumination  information  in  the  training  set.  Due  to  the  fact 
that  most  human  faces  have  similar  shapes,  only  one  sub¬ 
ject  is  often  sufficient  to  provide  images  of  different  illu¬ 
mination  patterns,  although  adding  more  subjects  may  fur¬ 
ther  improve  the  accuracy.  The  subject(s)  for  illumination 
transfer  can  be  selected  outside  the  set  of  training  subjects 
for  recognition.  Finally,  we  show  that  the  other  image  nui¬ 
sances,  including  pose  variation  and  image  corruption,  can 
be  readily  corrected  by  a  single  reference  image  of  arbitrary 
illumination  condition  per  class  combined  with  the  SIT  dic¬ 
tionary.  The  SIT  dictionary  also  does  not  need  to  know  the 
information  of  any  possible  facial  corruption  for  the  algo¬ 
rithm  to  be  robust.  To  the  best  of  our  knowledge,  this  work 
is  the  first  to  propose  a  solution  to  perform  facial  illumina¬ 
tion  compensation  in  the  alignment  stage  and  illumination 
and  pose  transfer  in  the  recognition  stage. 

In  terms  of  the  algorithm  complexity,  the  construction  of 
the  SIT  dictionary  is  extremely  simple  when  the  illumina¬ 
tion  data  of  the  SIT  subject(s)  are  provided,  and  it  does  not 
necessarily  involve  any  dictionary  learning  algorithm.  The 
algorithm  is  also  fast  to  execute  in  the  alignment  and  recog¬ 
nition  stages  compared  to  the  other  SRC-type  algorithms 
because  a  sparse  optimization  solver  such  as  those  in  lf29l  is 
now  faced  with  much  smaller  linear  systems. 

This  paper  bears  resemblance  to  the  work  called  Ex¬ 
tended  SRC  0,  whereby  an  intraclass  variant  dictionary 
was  similarly  added  to  be  a  part  of  the  SRC  objective  func¬ 
tion  for  recognition.  Our  work  differs  from  0  in  that  the 
proposed  SIT  dictionary  can  be  constructed  from  a  selection 
of  independent  subject(s)  only  for  the  purpose  of  illumina¬ 
tion  transfer.  As  a  result,  the  SIT  dictionary  is  impartial  to 
the  training  classes.  Furthermore,  by  transferring  both  the 
pose  and  illumination  from  the  alignment  stage  to  the  recog¬ 
nition  stage,  our  algorithm  can  handle  insufficient  illumina¬ 
tion  and  misalignment  at  the  same  time,  and  allows  for  the 
single  reference  images  to  have  arbitrary  illumination  con¬ 
ditions.  Finally,  our  algorithm  is  also  robust  to  moderate 
amounts  of  image  pixel  corruption,  even  though  we  do  not 
need  to  include  any  image  corruption  examples  in  the  SIT 


dictionary,  while  in  0  the  intraclass  variant  dictionary  uses 
both  normal  and  corrupted  face  samples.  We  also  compare 
our  performance  with  0  in  Section [4] 

2.  Sparse  Representation-based  Classification 

In  this  section,  we  first  briefly  review  the  SRC  formula¬ 
tion  and  introduce  the  notation. 

Assume  a  face  image  b  eRd  in  grayscale  can  be  written 
in  vector  form  by  stacking  its  pixels.  In  the  training  stage, 
given  L  training  subject  classes,  assume  rii  well-aligned 
training  images  Ai  =  [a^i,  a^,  •  •  •  ,  a^n.]  G  idxni  of  the 
same  dimension  as  b  are  sampled  for  the  i-th  class  under  the 
frontal  position  and  various  illumination  conditions.  These 
training  images  are  further  aligned  in  terms  of  the  coordi¬ 
nates  of  some  salient  facial  features,  e.g.,  eye  corners  and/or 
mouth  corners.  For  brevity,  the  training  images  under  such 
conditions  are  said  to  be  in  the  neutral  position.  Further¬ 
more,  we  do  not  consider  facial  expression  change  in  this 
paper.  Based  on  the  illumination  subspace  assumption,  if  b 
belongs  to  the  i-th  class,  then  b  lies  in  the  low-dimensional 
subspace  spanned  by  the  training  images  in  Ai,  namely, 

b  =  AiXi.  (1) 

In  the  query  stage,  the  query  image  b  may  contain  an  un¬ 
known  3D  pose  that  is  different  from  the  neutral  position. 
In  image  registration  literature  [18,  13 , 24]],  an  image  trans¬ 
formation  can  be  modeled  in  the  image  domain  as  r  G  T, 
where  T  is  a  finite-dimensional  group  of  transformations, 
such  as  translation,  similarity  transform,  and  homography. 
The  goal  of  the  alignment  is  to  recover  the  transformation  r, 
such  that  an  un warped  query  image  bo  of  the  same  subject 
in  the  neutral  position  can  be  written  as  &o  =  b  o  r  =  A^. 

In  robust  face  alignment,  the  issue  is  often  further  exac¬ 
erbated  by  the  cascade  of  complex  illumination  patterns  and 
moderate  image  pixel  corruption  and  occlusion.  In  the  SRC 
framework  [  26]  24),  the  combined  effect  of  image  misalign¬ 
ment  and  sparse  corruption  is  modeled  by 

Ti  =  arg  min  ||e||i  subj.  to  b  o  n  =  AiXi  +  e,  (2) 

Xi,e,Ti 

where  the  alignment  is  achieved  on  a  per-class  basis  for 
each  Ai,  and  e  G  is  the  sparse  alignment  error  as  the 
objective  function.  After  linearizing  the  nonlinear  image 
transformation  function  r,  can  be  solved  iteratively  by  a 
standard  -minimization  solver.  In  [24),  it  was  shown  that 
the  alignment  based  on  can  tolerate  translation  shift  up 
to  20%  of  the  between-eye  distance  and  up  to  30°  in-plane 
rotation,  which  is  typically  sufficient  to  compensate  moder¬ 
ate  misalignment  caused  by  a  good  face  detector. 

Once  the  optimal  transformation  rz  is  recovered  for  each 
class  i,  the  transformation  is  carried  over  to  the  recognition 
algorithm,  where  the  training  images  in  each  Ai  are  trans¬ 
formed  by  r f1  to  align  with  the  query  image  b.  Finally, 


a  global  sparse  representation  x  with  respect  to  the  trans¬ 
formed  training  images  is  sought  by  solving  the  following 
sparse  optimization  problem: 

x *  =  argmin^e  ||«||i  +  ||e||i. 

subj.  to  b  =  [Ai  o  t-j- 1 ,  •  •  •  ,  Al  o  rf1]  x  +  e 

(3) 

One  can  further  show  that  when  the  correlation  of  the  face 
samples  in  A  is  sufficiently  tight  in  the  high-dimensional 
image  space,  solving  via  i\  -minimization  guarantees  to 
recover  both  the  sparse  coefficients  x  and  very  dense  (spar¬ 
sity  p  /  1)  randomly  signed  error  e  [25]. 

3.  Sparse  Illumination  Transfer 
3.1.  Single-Sample  Alignment 

In  this  section,  we  first  propose  a  novel  face  alignment 
algorithm  that  is  effective  even  when  a  very  small  number 
of  training  images  are  provided  per  class.  In  the  extreme 
case,  we  specifically  consider  the  single- sample  face  align¬ 
ment  problem  where  only  one  training  image  of  arbitrary 
illumination  is  available  from  Class  i.  The  same  algorithm 
easily  extends  to  the  case  when  multiple  training  images  are 
provided. 

To  mitigate  the  scarcity  of  the  training  images,  some¬ 
thing  has  to  give  to  recover  the  missing  illumination  model 
under  which  the  image  appearance  of  a  human  face  can 
be  affected.  Motivated  by  the  idea  of  transfer  learning 
mmm,  we  stipulate  that  one  can  obtain  the  illumina¬ 
tion  information  for  both  alignment  and  recognition  from  a 
set  of  additional  subject  classes,  called  the  illumination  dic¬ 
tionary.  The  additional  face  images  have  the  same  frontal 
pose  as  the  training  images,  and  can  be  collected  offline  and 
can  be  different  from  the  query  classes  A  =  [Ai,  •  •  •  ,  Ajf. 
In  other  words,  no  matter  how  scarce  the  training  images 
of  the  query  classes  are,  one  can  always  obtain  a  potentially 
large  set  of  additional  face  images  of  unrelated  subjects  who 
may  have  similar  face  shapes  as  the  query  subjects  and  may 
provide  sufficient  illumination  examples. 

The  illumination  dictionary  for  an  additional  class  L  +  1 
is  defined  as  follows.  Assume  face  images  of  sufficient 
illumination  patterns  (cll+1,1,  aL+i?2,  •  ■  •  ,  aL+ |?n)  = 

(ci,c2,  •  •  •  ,cn)  are  samples  from  the  class,  further  assume 
all  images  in  vector  form  are  normalized  to  have  unit  length. 
Then  the  illumination  dictionary  by  the  (L  +  l)-th  subject 
can  be  written  as  the  difference  of  two  face  images  of  the 
same  shape: 

Cl  =  [c2  —  Cl,---  ,C„-Ci].  (4) 

The  multiplication  of  C\y  by  vector  y  can  further  gener¬ 
ate  more  complex  illumination  patterns  that  involve  multi¬ 
ple  images  in  the  columns  of  C\ . 

We  need  to  emphasize  here  that  although  the  construc¬ 
tion  of  C\  in  is  straightforward,  by  no  means  it  is  the 


only  way  to  obtain  an  illumination  dictionary.  In  the  lit¬ 
erature,  many  other  algorithms  are  well  known,  such  as 
the  quotient  image  G2na  and  edge-preserving  filters  0. 
The  focus  of  this  paper  is  not  on  the  illumination  trans¬ 
fer  function  per  se,  but  how  its  application  on  face  im¬ 
ages  can  enable  single-sample  alignment  and  recognition 
under  the  SRC  framework.  In  addition,  the  illumination 
transfer  shown  later  in  0  can  be  solved  by  efficient  i\- 
minimization  algorithms.  Therefore,  it  has  speed  advan¬ 
tages  compared  to  other  more  sophisticated  methods.  This 
approach  was  also  used  in  0  in  the  definition  of  the  intr¬ 
aclass  variant  dictionary,  but  only  for  recognition.  We  will 
compare  the  performance  of  the  two  methods  in  Section [4] 

Another  issue  with  the  illumination  dictionary  is  that, 
if  additional  subject  classes  beyond  L  +  1  are  provided, 
one  can  continue  to  construct  additional  dictionaries  C  = 
[Ci,  C2,  •  •  *  ].  However,  a  somewhat  unconventional  obser¬ 
vation  we  have  discovered  during  our  experiment  is  that  if 
the  first  dictionary  Ci  is  carefully  chosen,  a  single  addi¬ 
tional  subject  class  is  sufficient  to  achieve  extremely  good 
performance  for  face  alignment  and  recognition.  In  Section 
[4]  we  will  show  that  using  a  single  illumination  class,  our 
alignment  accuracy  using  only  one  reference  image  is  com¬ 
parable  to  that  of  1 24]  using  multiple  reference  images,  and 
the  subsequent  recognition  accuracy  further  exceeds  those 
using  manual  alignment  results. 

Clearly,  this  singular  subject  needs  to  have  the  facial  ap¬ 
pearance  that  is  close  to  the  “mean  face,”  which  has  been 
used  in  face  recognition  to  refer  to  the  average  appear¬ 
ance  of  faces  over  a  population  0.  On  the  other  hand, 
using  those  examples  with  abnormal  facial  features  such 
as  glasses  and  beard  could  easily  reduce  the  performance. 
Without  loss  of  generality,  we  assume  C  =  C\  in  this  pa¬ 
per.  In  Section |4~4|  we  will  examine  the  efficacy  of  design¬ 
ing  different  illumination  dictionaries  with  more  subjects. 


Figure  1.  Examples  of  the  elements  of  an  illumination  dictionary 
C  constructed  from  the  YaleB  database. 

Nevertheless,  given  the  limited  number  of  training  im¬ 
ages  in  practice ,  the  illumination  dictionary  itself  also  can¬ 
not  be  arbitrarily  large.  Therefore,  an  effective  solution 
should  be  able  to  achieve  accurate  alignment  while  only  re¬ 
lying  on  a  few  illumination  samples.  Our  solution  is  called 
sparse  illumination  transfer  (SIT): 

n  =  argmin^^e  ||yji  +  A||e||i, 
subj.  to  bon  =  aiXi  +  Cyi  +  e 

where  A  is  a  parameter  that  balances  the  weight  of  y  and 


e,  which  can  be  chosen  empirically.  In  our  experiment, 
we  found  A  =  1  generally  led  to  good  performance  for 
both  uncorrupted  and  corrupted  cases.  Finally,  the  objective 
function  0  can  be  solved  efficiently  using  £1  -minimization 
techniques  such  as  those  discussed  in  l24l  l29lj^]  Figure  [2] 
shows  two  examples  of  the  alignment  results. 


Figure  2.  Single-sample  alignment  results  on  Multi-PIE.  The  solid 
red  boxes  are  the  initial  face  locations  provided  by  a  face  detector. 
The  dash  green  boxes  show  the  alignment  results.  Left:  The  sub¬ 
ject  wears  glasses.  Right:  The  subject  image  has  30%  of  the  face 
pixels  corrupted  by  random  noise. 

3.2.  Single-Sample  Recognition 

Next,  we  propose  a  novel  face  recognition  algorithm  that 
extends  the  SRC  framework  to  the  single-sample  regime. 
Similar  to  the  above  alignment  algorithm,  the  algorithm  also 
applies  trivially  when  multiple  training  samples  per  class 
are  available. 

Given  the  same  reference  image  di  as  in  ([5]),  again  we  as¬ 
sume  di  is  sampled  from  a  random  illumination  condition. 
The  key  idea  of  our  algorithm  is  to  transfer  and  apply  the  es¬ 
timated  image  transformation  rt  and  the  SIT  compensation 
Cyi  directly  from  the  alignment  step  ([5])  to  the  recognition 
step.  More  specifically,  for  each  reference  image  di  of  class 
i,  define  its  warped  version  as 

di  =  (aiXi  +  Cy^  o  Tp.  (6) 

The  modified  reference  image  di  aligns  the  orientation  of 
di  towards  the  query  image,  and  at  the  same  time  adjusts 
the  appearance  of  di  to  take  into  account  the  transferred 
illumination  model  Cyi.  Some  examples  about  this  effect 
are  shown  in  Figure  [3]  After  the  SIT  is  applied  to  all  the 
training  images,  we  obtain  the  following  warped  training 
dictionary  of  L  columns: 

A=  [di,---  ,aL\.  (7) 

The  SIT  recognition  algorithm  solves  a  sparse  represen¬ 
tation  of  the  query  image  b  in  the  following  linear  system: 

a:*  =  argmin^e  ||cc||i  +  A||e||i, 

subj.  to  b  =  Ax  +  e 

2 In  addition  to  seeking  a  sparse  representation  y,  an  alternative  solution 
could  minimize  the  £2  -norm  of  y  instead,  as  used  in  124113H  .  We  have  also 
tested  the  variation,  and  found  the  difference  between  the  two  solutions  to 
the  small,  with  minimizing  ||y||i  slightly  better  than  minimizing  \\y\\2- 


Figure  3.  Examples  of  warping  a  single  reference  image  d\  — 
(oi  +  Cy{)  o  t~  1  for  recognition.  Left:  Query  image  b.  Middle 
Left:  Reference  image  a^.  Middle  Right:  Illumination  transfer 
information  Cyi.  Right:  Warped  reference  di  has  closer  pose  and 
illumination  to  b  than  the  original  image  ai. 


where  the  parameter  A  can  be  chosen  empirically. 

In  ([8]),  the  SIT  dictionary  A  only  has  L  columns  rep¬ 
resenting  the  training  images  from  the  L  class,  respectively. 
As  a  result,  the  recognition  algorithm  to  recover  the  class  la¬ 
bel  of  b  can  be  simplified  from  the  original  SRC  algorithm 
ll26l.  where  the  class  corresponding  to  the  largest  coefficient 
magnitude  in  x  is  the  estimated  class  of  the  query  image  b. 
Figure]?] shows  the  estimated  coefficients  of  an  example  of 
SIT  recognition. 


Figure  4.  Illustration  of  SIT  recognition.  Top  Left:  b.  Top  Right: 
e.  Bottom:  Sparse  representation  x  with  the  correct  reference 
image  ai  superimposed. 


Before  we  move  on  to  examine  the  performance  of  the 
new  recognition  algorithm  one  may  question  the  effi¬ 
cacy  of  enforcing  a  sparse  representation  in  the  constraint 
fj.  The  question  may  arise  because  in  the  original  SRC 
framework,  the  data  matrix  A  =  [Ai,  •  •  •  ,  Al\  is  a  collec¬ 
tion  of  highly  correlated  image  samples  that  span  the  L  illu¬ 
mination  subspaces.  Therefore,  it  makes  sense  to  enforce  a 
sparse  representation  as  also  validated  by  several  followup 
studies  mmm.  However,  in  single- sample  recognition, 


only  one  sample  ai  is  provided  per  class.  Therefore,  one 
would  think  that  the  best  recognition  performance  can  only 
be  achieved  by  the  nearest-neighbor  algorithm. 

There  are  at  least  two  arguments  to  justify  the  use  of 
sparse  representation  in  <HJ>-  One  one  hand,  as  discussed  in 
|2CL  in  the  case  that  e  is  a  small  dense  error  and  the  nearest- 
neighbor  solution  corresponds  to  a  one- sparse  binary  vector 
x0  =  [•  •  •  ,  0, 1,  0  •  •  •  ]T  in  the  formulation  ([8]),  then  solving 

via  i\ -minimization  can  also  recover  the  sparsest  solu¬ 
tion,  namely,  x*  «  Xq.  On  the  other  hand,  in  the  case  that  e 
represents  a  gross  image  corruption,  as  long  as  the  elements 
of  A  in  remain  tightly  correlated  in  the  image  space,  the 
£i  -minimization  algorithm  can  compensate  the  dense  error 
in  the  query  image  b  (25).  This  is  a  unique  advantage  over 
nearest-neighbor  type  algorithms. 

4.  Experiment 

In  this  section,  we  present  a  comprehensive  experi¬ 
ment  to  demonstrate  the  performance  of  our  alignment  and 
recognition  algorithms.  The  illumination  dictionary  is  con¬ 
structed  from  YaleB  face  database  Col.  YaleB  contains 
5760  single  light  source  image  of  10  subjects  under  9  poses 
and  64  illumination  conditions.  For  every  subject  in  a  par¬ 
ticular  pose,  an  image  with  ambient  (background)  illumina¬ 
tion  was  also  captured.  In  our  experiments,  we  only  use  the 
first  subject  with  its  65  aligned  frontal  images  (64  illumina¬ 
tions  +  1  ambient)  to  construct  our  illumination  dictionary. 
The  dictionary  C  is  constructed  by  subtracting  the  ambient 
image  from  the  other  64  illumination  image.  For  a  fair  com¬ 
parison,  all  the  experiments  in  this  section  share  the  same 
YaleB  illumination  dictionary. 

For  the  training  and  query  subjects,  we  choose  images 
from  a  much  larger  CMU  Multi-PIE  database  CD.  Except 
for  Section  |4~3|  166  shared  subject  classes  from  Session  1 
and  Session  2  are  selected  for  testing.  In  Session  1,  we  ran¬ 
domly  select  one  frontal  image  per  class  with  arbitrary  il¬ 
lumination  as  the  training  image.  Then  we  randomly  select 
two  different  frontal  images  from  Session  1  or  Session  2  for 
testing.  The  outer  eye  corners  of  both  training  and  query 
images  are  manually  marked  as  the  ground  truth  for  regis¬ 
tration.  All  the  training  face  images  are  manually  cropped 
into  60  x  60  pixels  based  on  the  locations  of  eyes  out-corner 
points,  and  the  distance  between  the  two  outer  eye  corners 
is  normalized  to  be  50  pixels  for  each  person.  We  again 
emphasize  that  our  experimental  setting  is  more  practical 
than  those  used  in  some  other  publications,  as  we  allow  the 
training  images  to  have  arbitrary  illumination  and  not  nec¬ 
essarily  just  the  ambient  illumination. 

We  compare  our  algorithms  with  several  state-of-the-art 
face  alignment  and  recognition  algorithms  under  the  SRC 
framework.  For  the  alignment  benchmark,  we  compare 
with  the  deformable  SRC  (DSRC)  algorithm  l24l  and  the 
misalignment  robust  representation  (MRR)  algorithm  l30l. 


For  the  recognition  benchmark,  we  compare  with  DSRC, 
MRR  based  on  the  above  automatic  alignment  results  to 
find  face  regions.  We  also  compare  with  the  original  SRC 
algorithm  [26 1  and  Extended  SRC  (ESRC)  (6)  with  the  face 
region  location  provided  by  manual  labeling. 

4.1.  Simulation  on  2D  Alignment 

We  first  demonstrate  the  performance  of  the  SIT  align¬ 
ment  algorithm  dealing  with  simulated  2D  deformation,  in¬ 
cluding  translation,  rotation  and  scaling.  The  added  de¬ 
formation  is  introduced  to  the  query  images  based  on  the 
ground  truth  coordinates  of  eye  corners.  The  translation 
ranges  from  [-12,  12]  pixels  with  a  step  of  2  pixels.  Sim¬ 
ilar  to  O,  we  use  the  estimated  alignment  error  ||e||i  as 
an  indicator  of  success.  More  specifically,  let  eo  be  the 
alignment  error  obtained  by  aligning  a  query  image  from 
the  manually  labeled  position  to  the  training  images.  We 
consider  the  alignment  successful  if  |||e||i  —  ||eo||i|  < 

0.01 1|  e0  ||i. 

We  compare  our  method  with  DSRC  and  MRR.  As 
DSRC  and  MRR  would  require  to  have  multiple  reference 
images  per  class,  to  provide  a  fair  comparison,  we  evaluate 
both  algorithms  under  two  settings:  Firstly  seven  reference 
images  are  provided  per  class  to  DSRC^j  We  denote  this 
case  as  DSRC-7.  Secondly,  one  randomly  chosen  image  per 
class  as  the  same  setting  as  the  SIT  algorithm.  We  denote 
this  case  as  DSRC-1  and  MRR-1. 

We  draw  the  following  observations  from  the  alignment 
results  shown  in  Figure  [5] 

1 .  SIT  works  well  under  a  broad  range  of  2D  deforma¬ 
tion,  particularly  when  the  translation  in  x  or  y  direc¬ 
tion  is  less  than  20%  of  the  eye  distance  (10  pixels)  and 
when  the  in-plane  rotation  is  less  than  30° . 

2.  Clearly,  SIT  outperforms  both  DSRC-1  and  MRR-1 
when  the  same  setting  is  used,  namely,  one  sample 
per  class.  The  obvious  reason  is  that  DSRC  and  MRR 
were  not  designed  to  handle  the  single- sample  align¬ 
ment  scenario. 

3.  SIT  slightly  outperforms  DSRC-7,  where  DSRC-7  has 
access  to  seven  training  images  of  different  illumina¬ 
tion  conditions.  Furthermore,  the  SIT  dictionary  is 
derived  from  a  single  subject  class  from  the  unrelated 
YaleB  database.  It  validates  that  illumination  examples 
of  a  well-chosen  subject  are  sufficient  for  SIT  align¬ 
ment. 

4.2.  Single-Sample  Recognition 

In  this  subsection,  we  evaluate  the  SIT  recognition  algo¬ 
rithm  based  on  single  reference  images  of  the  166  subject 

3The  training  are  illuminations  {0,1,7,13,14,16,18}  in  Multi-PIE  Ses¬ 
sion  1. 


classes  shared  in  Multi-PIE  Sessions  1  &  2.  We  compare  its 
performance  with  SRC,  ESRC,  DSRC,  and  MRR. 

First,  we  note  that  the  new  SIT  framework  and  the  ex¬ 
isting  sparse  representation  algorithms  are  not  mutually  ex¬ 
clusive.  In  particular,  the  illumination  transfer  can  be 
easily  adopted  by  the  other  algorithms  to  improve  the  illu¬ 
mination  condition  of  the  training  images,  especially  in  the 
single- sample  setting.  In  the  first  experiment,  we  demon¬ 
strate  the  improvement  of  SRC  and  ESRC  with  the  illumi¬ 
nation  transfer.  Since  both  algorithms  do  not  address  the 
alignment  problem,  manual  labels  of  the  face  location  are 
assumed  to  be  the  aligned  face  location.  The  comparison  is 
presented  in  Table  [I] 

Table  1.  Single-sample  recognition  accuracy  via  manual  align¬ 
ment _ 


Method 

Session  1  (%) 

Session  2  (%) 

SRCm 

88.0 

53.6 

ESRCm 

89.6 

56.6 

SRCm  +  SIT 

91.6 

59.0 

ESRCm  +  SIT 

93.6 

59.3 

We  observe  that  since  the  training  images  are  selected 
from  Session  1,  there  is  no  surprise  that  the  recognition 
rates  of  those  testing  images  also  from  Session  1  are  sig¬ 
nificantly  higher  than  those  of  Session  2.  The  comparison 
further  shows  adding  the  illumination  transfer  information 
to  the  SRC  and  ESRC  algorithms  meaningfully  improves 
their  performance  by  3%  -  4%. 

Second,  we  compare  DSRC,  MRR,  and  SIT  in  the  full 
pipeline  of  alignment  plus  recognition  shown  in  Table  [2] 

Table  2.  Single- sample  alignment  +  recognition  accuracy. 


Method 

Session  1  (%)  Session  2  (%) 

DSRC 

MRR 

36.1  35.7 

46.2  34.6 

SIT 

79.9  65.7 

Compared  with  the  past  reported  results  of  DSRC  and 
MRR,  their  recognition  accuracy  decreases  significantly 
when  only  one  training  image  is  available  per  class.  It 
demonstrates  that  these  algorithm  were  not  designed  to  per¬ 
form  well  in  the  single-sample  regime.  In  both  Session  1 
and  Session  2,  SIT  outperforms  both  algorithms  by  more 
the  30%.  It  is  more  interesting  to  compare  the  Session  2 
recognition  rates  in  Table  [T]  and  Table  [2|  the  more  difficult 
and  realistic  experiment.  SIT  that  relies  on  a  SIT  dictio¬ 
nary  to  automatically  alignment  the  testing  images  achieves 
65.7%,  which  is  even  higher  than  the  ESRC  rate  of  59.3% 
with  manual  alignment. 

4.3.  Robustness  under  Random  Corruption 

In  this  subsection,  we  further  compare  the  robustness  of 
the  SIT  recognition  algorithm  to  random  pixel  corruption. 


Figure  5.  Success  rate  of  face  alignment  under  four  types  of  2D  deformation:  ^-translation,  //-translation,  rotation,  and  scaling.  The  amount 
of  translation  is  expressed  in  pixels,  and  the  in-plane  rotation  is  expressed  in  degrees. 


We  again  compare  the  overall  recognition  rate  of  SIT  with 
DSRC  and  MRR,  the  two  most  relevant  algorithms. 

To  benchmark  the  recognition  under  different  corruption 
percentage,  it  is  important  that  the  query  images  and  the 
training  images  have  close  facial  appearance,  otherwise  dif¬ 
ferent  facial  features  would  also  contribute  to  facial  corrup¬ 
tion  or  disguise,  such  as  glasses,  beard,  or  different  hair 
styles.  To  limit  this  variability,  in  this  experiment,  we  use 
Multi-PIE  Session  1  for  both  training  and  testing,  although 
the  images  should  never  overlap.  We  use  all  the  subjects 
in  Session  1  as  the  training  and  testing  sets.  For  each  sub¬ 
ject,  we  randomly  select  one  frontal  image  with  arbitrary 
illumination  for  testing.  Various  levels  of  image  corruption 
from  10%  to  40%  are  randomly  generated  in  the  face  re¬ 
gion.  Similar  to  the  previous  experiments,  the  face  regions 
are  detected  by  Viola-Jones  detector.  The  performance  of 
the  three  algorithms  is  shown  in  Table  [3] 


Table  3.  Recognition  rates  (%)  under  various  random  corruption. 


Corruption 

10% 

20% 

30% 

40% 

DSRC 

32.9% 

31.7% 

28.9% 

24.1% 

MRR 

24.9% 

14.5% 

11.7% 

9.2% 

SIT 

74.3% 

70.3% 

67.1% 

55.8% 

The  comparison  is  more  illustrative  than  Table  [2]  For  in¬ 
stance,  with  40%  pixel  corruption,  SIT  still  maintains  56% 
accuracy;  with  10%  corruption,  SIT  outperforms  DSRC  and 
MRR  by  more  than  40%. 

4.4.  Multiple-Subject  SIT  Dictionaries 

The  last  topic  of  our  discussion  is  the  effect  of  choosing 
multiple  subject  classes  for  building  the  SIT  dictionary,  as 
we  previously  mentioned  in  In  the  above  alignment  and 
recognition  comparison,  we  have  seen  that  SIT  is  compa¬ 


rable  to  or  outperforms  the  existing  face  recognition  algo¬ 
rithms  using  just  a  one-subject  illumination  dictionary.  In 
this  experiment,  we  provide  some  empirical  observations  to 
investigate  the  change  of  its  alignment  accuracy  from  us¬ 
ing  one  subject  to  10  subjects.  Figure  [6]  again  shows  the 
alignment  success  rates  when  the  face  bounding  box  un¬ 
dergoes  x-axis  and  y- axis  translation,  respectively,  between 
[-12,  12]  pixels. 


X-Translation  only  (Pixels) 


Figure  6.  SIT  alignment  success  rates  from  one  to  10  subjects. 

We  observe  that  adjusting  the  size  of  the  illumination 
dictionary  does  affect  the  alignment  performance.  How¬ 
ever,  the  change  is  not  monotonically  increasing  with  more 
subject  classes.  In  particular,  for  x-translation,  all  dictio¬ 
naries  are  able  to  maintain  good  performance  (above  98% 
recognition  rate)  even  when  the  translation  is  as  large  as 
±10  pixels.  For  //-translation,  the  single-sample  illumina¬ 
tion  dictionary  slightly  outperforms  the  others  with  more 
subjects  when  the  translation  is  large. 


5.  Conclusion  and  Discussion 

In  this  paper,  we  have  presented  a  novel  face  recognition 
algorithm  specifically  designed  for  single- sample  alignment 
and  recognition.  Although  we  have  provided  some  excit¬ 
ing  results  that  represent  a  meaningful  step  forward  towards 
a  real-world  face  recognition  system,  there  remain  several 
open  problems  that  warrant  further  investigation.  First,  al¬ 
though  the  current  way  of  constructing  the  illumination  dic¬ 
tionary  is  efficient,  the  method  is  not  able  to  separate  the 
effect  of  surface  albedo,  shape,  and  illumination  completely 
on  face  images.  Therefore,  a  more  sophisticated  illumi¬ 
nation  transfer  algorithm  could  lead  to  better  overall  per¬ 
formance.  Second,  although  we  have  demonstrated  em¬ 
pirically  in  Section  [4~4]  that  including  more  subjects  in  the 
illumination  dictionary  may  not  necessarily  lead  to  better 
performance,  one  could  study  whether  a  better  dictionary 
learning  algorithm  could  be  applied  to  formulate  the  illumi¬ 
nation  dictionary  that  might  represent  more  face  shapes  and 
illumination  patterns. 
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